import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
cf=pd.read_excel("D:\\Jigsaw\\Project-student attrition\\Student2.xlsx")
shape is a tuple that gives you an indication of the number of dimensions in the array. Here It gives number of Rows and columns
cf.shape
(3400, 56)
Pandas is a fantastic ecosystem of data-centric Python packages. Due to which is one of those packages and makes importing and analyzing data much easier.
head() method. Pandas head() method is used to return top n (5 by default) rows of a data frame or series.
cf.head()
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_GENDER | STDNT_BACKGROUND | IN_STATE_FLAG | INTERNATIONAL_STS | STDNT_MAJOR | STDNT_MINOR | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | ... | DEGREE_GROUP_CD | DEGREE_GROUP_DESC | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7808615 | 18 | F | BGD 1 | Y | N | Undeclared | N | NaN | 1150.0 | ... | B | Bachelors | 16 | 16 | 14.0 | 14.0 | 0 | 0 | 0 | 0.0 |
| 1 | 7830063 | 19 | F | BGD 1 | N | N | Undeclared | N | 26.0 | NaN | ... | B | Bachelors | 18 | 18 | 18.0 | 18.0 | 570000 | 1355760 | 785760 | 459300.0 |
| 2 | 7847538 | 18 | M | BGD 1 | Y | N | Mathematics | N | NaN | 1020.0 | ... | B | Bachelors | 15 | 15 | 14.0 | 14.0 | 0 | 0 | 0 | 0.0 |
| 3 | 8006429 | 18 | M | BGD 1 | Y | N | Undeclared | N | NaN | 1210.0 | ... | B | Bachelors | 13 | 13 | 14.0 | 14.0 | 0 | 0 | 0 | 0.0 |
| 4 | 7962680 | 18 | F | BGD 1 | Y | N | Art | N | NaN | 1200.0 | ... | B | Bachelors | 12 | 12 | 12.0 | 12.0 | 835920 | 1355760 | 519840 | 278340.0 |
5 rows × 56 columns
The tail() function is used to get the last n rows. This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows. Number of rows to select by default it will show last 5 rows in python.
cf.tail()
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_GENDER | STDNT_BACKGROUND | IN_STATE_FLAG | INTERNATIONAL_STS | STDNT_MAJOR | STDNT_MINOR | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | ... | DEGREE_GROUP_CD | DEGREE_GROUP_DESC | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3395 | 7948112 | 18 | F | BGD 4 | N | N | Music Performance | N | NaN | 800.0 | ... | B | Bachelors | 18 | 18 | 16.0 | 13.0 | 0 | 0 | 0 | 0.0 |
| 3396 | 8023055 | 18 | F | BGD 1 | Y | N | Biology | N | 20.0 | NaN | ... | B | Bachelors | 13 | 9 | NaN | NaN | 0 | 0 | 0 | 0.0 |
| 3397 | 7926915 | 18 | F | BGD 4 | Y | N | Biology | N | NaN | 1020.0 | ... | B | Bachelors | 14 | 6 | NaN | NaN | 0 | 945840 | 1288980 | 0.0 |
| 3398 | 7877332 | 18 | F | BGD 3 | Y | N | Joint Enrollment - Accel | N | 21.0 | NaN | ... | B | Bachelors | 12 | 9 | NaN | NaN | 0 | 269100 | 0 | -231720.0 |
| 3399 | 7928405 | 18 | M | BGD 2 | Y | N | Art | Creative Writing | NaN | 890.0 | ... | B | Bachelors | 12 | 9 | 13.0 | 6.0 | 0 | 0 | 0 | 0.0 |
5 rows × 56 columns
The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. Inorder to print the full summary.
cf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3400 entries, 0 to 3399 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STUDENT IDENTIFIER 3400 non-null int64 1 STDNT_AGE 3400 non-null int64 2 STDNT_GENDER 3400 non-null object 3 STDNT_BACKGROUND 3400 non-null object 4 IN_STATE_FLAG 3400 non-null object 5 INTERNATIONAL_STS 3400 non-null object 6 STDNT_MAJOR 3400 non-null object 7 STDNT_MINOR 3400 non-null object 8 STDNT_TEST_ENTRANCE1 1106 non-null float64 9 STDNT_TEST_ENTRANCE2 2492 non-null float64 10 STDNT_TEST_ENTRANCE_COMB 2882 non-null float64 11 FIRST_TERM 3400 non-null int64 12 CORE_COURSE_NAME_1_F 3400 non-null object 13 CORE_COURSE_GRADE_1_F 3400 non-null object 14 CORE_COURSE_NAME_2_F 3301 non-null object 15 CORE_COURSE_GRADE_2_F 3301 non-null object 16 CORE_COURSE_NAME_3_F 2835 non-null object 17 CORE_COURSE_GRADE_3_F 2835 non-null object 18 CORE_COURSE_NAME_4_F 1803 non-null object 19 CORE_COURSE_GRADE_4_F 1803 non-null object 20 CORE_COURSE_NAME_5_F 645 non-null object 21 CORE_COURSE_GRADE_5_F 645 non-null object 22 CORE_COURSE_NAME_6_F 128 non-null object 23 CORE_COURSE_GRADE_6_F 128 non-null object 24 SECOND_TERM 3400 non-null int64 25 CORE_COURSE_NAME_1_S 3243 non-null object 26 CORE_COURSE_GRADE_1_S 3168 non-null object 27 CORE_COURSE_NAME_2_S 2961 non-null object 28 CORE_COURSE_GRADE_2_S 2961 non-null object 29 CORE_COURSE_NAME_3_S 2362 non-null object 30 CORE_COURSE_GRADE_3_S 2362 non-null object 31 CORE_COURSE_NAME_4_S 1355 non-null object 32 CORE_COURSE_GRADE_4_S 1355 non-null object 33 CORE_COURSE_NAME_5_S 450 non-null object 34 CORE_COURSE_GRADE_5_S 450 non-null object 35 CORE_COURSE_NAME_6_S 81 non-null object 36 CORE_COURSE_GRADE_6_S 81 non-null object 37 HOUSING_STS 3400 non-null object 38 RETURNED_2ND_YR 3400 non-null int64 39 DISTANCE_FROM_HOME 3375 non-null float64 40 HIGH_SCHL_GPA 3347 non-null float64 41 HIGH_SCHL_NAME 3399 non-null object 42 FATHER_HI_EDU_CD 2968 non-null float64 43 FATHER_HI_EDU_DESC 3400 non-null object 44 MOTHER_HI_EDU_CD 2911 non-null float64 45 MOTHER_HI_EDU_DESC 3400 non-null object 46 DEGREE_GROUP_CD 3400 non-null object 47 DEGREE_GROUP_DESC 3400 non-null object 48 FIRST_TERM_ATTEMPT_HRS 3400 non-null int64 49 FIRST_TERM_EARNED_HRS 3400 non-null int64 50 SECOND_TERM_ATTEMPT_HRS 3194 non-null float64 51 SECOND_TERM_EARNED_HRS 3191 non-null float64 52 GROSS_FIN_NEED 3400 non-null int64 53 COST_OF_ATTEND 3400 non-null int64 54 EST_FAM_CONTRIBUTION 3400 non-null int64 55 UNMET_NEED 3400 non-null float64 dtypes: float64(10), int64(10), object(36) memory usage: 1.5+ MB
cf.columns
# It shows number of columns in entire data frame
Index(['STUDENT IDENTIFIER', 'STDNT_AGE', 'STDNT_GENDER', 'STDNT_BACKGROUND',
'IN_STATE_FLAG', 'INTERNATIONAL_STS', 'STDNT_MAJOR', 'STDNT_MINOR',
'STDNT_TEST_ENTRANCE1', 'STDNT_TEST_ENTRANCE2',
'STDNT_TEST_ENTRANCE_COMB', 'FIRST_TERM', 'CORE_COURSE_NAME_1_F',
'CORE_COURSE_GRADE_1_F', 'CORE_COURSE_NAME_2_F',
'CORE_COURSE_GRADE_2_F', 'CORE_COURSE_NAME_3_F',
'CORE_COURSE_GRADE_3_F', 'CORE_COURSE_NAME_4_F',
'CORE_COURSE_GRADE_4_F', 'CORE_COURSE_NAME_5_F',
'CORE_COURSE_GRADE_5_F', 'CORE_COURSE_NAME_6_F',
'CORE_COURSE_GRADE_6_F', 'SECOND_TERM', 'CORE_COURSE_NAME_1_S',
'CORE_COURSE_GRADE_1_S', 'CORE_COURSE_NAME_2_S',
'CORE_COURSE_GRADE_2_S', 'CORE_COURSE_NAME_3_S',
'CORE_COURSE_GRADE_3_S', 'CORE_COURSE_NAME_4_S',
'CORE_COURSE_GRADE_4_S', 'CORE_COURSE_NAME_5_S',
'CORE_COURSE_GRADE_5_S', 'CORE_COURSE_NAME_6_S',
'CORE_COURSE_GRADE_6_S', 'HOUSING_STS', 'RETURNED_2ND_YR',
'DISTANCE_FROM_HOME', 'HIGH_SCHL_GPA', 'HIGH_SCHL_NAME',
'FATHER_HI_EDU_CD', 'FATHER_HI_EDU_DESC', 'MOTHER_HI_EDU_CD',
'MOTHER_HI_EDU_DESC', 'DEGREE_GROUP_CD', 'DEGREE_GROUP_DESC',
'FIRST_TERM_ATTEMPT_HRS', 'FIRST_TERM_EARNED_HRS',
'SECOND_TERM_ATTEMPT_HRS', 'SECOND_TERM_EARNED_HRS', 'GROSS_FIN_NEED',
'COST_OF_ATTEND', 'EST_FAM_CONTRIBUTION', 'UNMET_NEED'],
dtype='object')
cf.skew()
STUDENT IDENTIFIER -0.025792 STDNT_AGE 4.416754 STDNT_TEST_ENTRANCE1 0.671944 STDNT_TEST_ENTRANCE2 -0.922564 STDNT_TEST_ENTRANCE_COMB -0.364692 FIRST_TERM -0.170108 SECOND_TERM -0.170108 RETURNED_2ND_YR -1.405151 DISTANCE_FROM_HOME 18.439510 HIGH_SCHL_GPA -0.248503 FATHER_HI_EDU_CD 0.192835 MOTHER_HI_EDU_CD -0.066167 FIRST_TERM_ATTEMPT_HRS 0.380062 FIRST_TERM_EARNED_HRS -0.902359 SECOND_TERM_ATTEMPT_HRS -0.178607 SECOND_TERM_EARNED_HRS -0.891158 GROSS_FIN_NEED 1.235763 COST_OF_ATTEND 0.345186 EST_FAM_CONTRIBUTION 3.815042 UNMET_NEED 0.665665 dtype: float64
This shows some columns are highly skewed & needs attention Highly skewed column includes, STDNT_AGE & DISTANCE_FROM_HOME As well SECOND_TERM_EARNED_HRS is highly negatively skewed
#check percentiles for some columns
quan=[]
for i in list(range(1,101)):
quan.append(i/100)
print(cf['DISTANCE_FROM_HOME'].quantile(q=quan).tail())
print(cf['STDNT_AGE'].quantile(q=quan).tail())
0.96 208.0 0.97 235.0 0.98 275.0 0.99 666.0 1.00 5932.0 Name: DISTANCE_FROM_HOME, dtype: float64 0.96 19.0 0.97 19.0 0.98 19.0 0.99 19.0 1.00 26.0 Name: STDNT_AGE, dtype: float64
#Normalise distance fomhome column to treat skewness.Normalization converts all data points to decimals between 0 and 1.
def normalize(column):
upper = column.max()
lower = column.min()
y = (column - lower)/(upper-lower)
return y
normalized = normalize(cf.DISTANCE_FROM_HOME)
normalized.describe()
count 3375.000000 mean 0.016828 std 0.039649 min 0.000000 25% 0.011632 50% 0.011632 75% 0.023264 max 1.000000 Name: DISTANCE_FROM_HOME, dtype: float64
cf.shape
(3400, 56)
#No of unique values per column
for column in cf.columns:
print(f"{column}: Number of unique values {cf[column].nunique()}")
print("==========================================================")
STUDENT IDENTIFIER: Number of unique values 3400 ========================================================== STDNT_AGE: Number of unique values 11 ========================================================== STDNT_GENDER: Number of unique values 2 ========================================================== STDNT_BACKGROUND: Number of unique values 8 ========================================================== IN_STATE_FLAG: Number of unique values 2 ========================================================== INTERNATIONAL_STS: Number of unique values 2 ========================================================== STDNT_MAJOR: Number of unique values 54 ========================================================== STDNT_MINOR: Number of unique values 37 ========================================================== STDNT_TEST_ENTRANCE1: Number of unique values 23 ========================================================== STDNT_TEST_ENTRANCE2: Number of unique values 109 ========================================================== STDNT_TEST_ENTRANCE_COMB: Number of unique values 24 ========================================================== FIRST_TERM: Number of unique values 6 ========================================================== CORE_COURSE_NAME_1_F: Number of unique values 69 ========================================================== CORE_COURSE_GRADE_1_F: Number of unique values 7 ========================================================== CORE_COURSE_NAME_2_F: Number of unique values 74 ========================================================== CORE_COURSE_GRADE_2_F: Number of unique values 7 ========================================================== CORE_COURSE_NAME_3_F: Number of unique values 74 ========================================================== CORE_COURSE_GRADE_3_F: Number of unique values 6 ========================================================== CORE_COURSE_NAME_4_F: Number of unique values 68 ========================================================== CORE_COURSE_GRADE_4_F: Number of unique values 6 ========================================================== CORE_COURSE_NAME_5_F: Number of unique values 53 ========================================================== CORE_COURSE_GRADE_5_F: Number of unique values 4 ========================================================== CORE_COURSE_NAME_6_F: Number of unique values 24 ========================================================== CORE_COURSE_GRADE_6_F: Number of unique values 4 ========================================================== SECOND_TERM: Number of unique values 6 ========================================================== CORE_COURSE_NAME_1_S: Number of unique values 66 ========================================================== CORE_COURSE_GRADE_1_S: Number of unique values 8 ========================================================== CORE_COURSE_NAME_2_S: Number of unique values 69 ========================================================== CORE_COURSE_GRADE_2_S: Number of unique values 6 ========================================================== CORE_COURSE_NAME_3_S: Number of unique values 68 ========================================================== CORE_COURSE_GRADE_3_S: Number of unique values 6 ========================================================== CORE_COURSE_NAME_4_S: Number of unique values 64 ========================================================== CORE_COURSE_GRADE_4_S: Number of unique values 5 ========================================================== CORE_COURSE_NAME_5_S: Number of unique values 51 ========================================================== CORE_COURSE_GRADE_5_S: Number of unique values 5 ========================================================== CORE_COURSE_NAME_6_S: Number of unique values 26 ========================================================== CORE_COURSE_GRADE_6_S: Number of unique values 3 ========================================================== HOUSING_STS: Number of unique values 2 ========================================================== RETURNED_2ND_YR: Number of unique values 2 ========================================================== DISTANCE_FROM_HOME: Number of unique values 71 ========================================================== HIGH_SCHL_GPA: Number of unique values 200 ========================================================== HIGH_SCHL_NAME: Number of unique values 439 ========================================================== FATHER_HI_EDU_CD: Number of unique values 4 ========================================================== FATHER_HI_EDU_DESC: Number of unique values 5 ========================================================== MOTHER_HI_EDU_CD: Number of unique values 4 ========================================================== MOTHER_HI_EDU_DESC: Number of unique values 5 ========================================================== DEGREE_GROUP_CD: Number of unique values 3 ========================================================== DEGREE_GROUP_DESC: Number of unique values 3 ========================================================== FIRST_TERM_ATTEMPT_HRS: Number of unique values 12 ========================================================== FIRST_TERM_EARNED_HRS: Number of unique values 22 ========================================================== SECOND_TERM_ATTEMPT_HRS: Number of unique values 21 ========================================================== SECOND_TERM_EARNED_HRS: Number of unique values 23 ========================================================== GROSS_FIN_NEED: Number of unique values 927 ========================================================== COST_OF_ATTEND: Number of unique values 120 ========================================================== EST_FAM_CONTRIBUTION: Number of unique values 1236 ========================================================== UNMET_NEED: Number of unique values 1197 ==========================================================
demo1 = []
for column in cf.columns:
if 1 < cf[column].nunique() <= 2:
# if len(cf[column].unique()) == 2:
# if raw1[column].dtype == object and len(raw1[column].unique()) <= 20:
demo1.append(column)
print(f"{column} : {cf[column].unique()}")
print(cf[column].value_counts())
print("====================================")
demo1.remove('RETURNED_2ND_YR')
STDNT_GENDER : ['F' 'M'] F 2140 M 1260 Name: STDNT_GENDER, dtype: int64 ==================================== IN_STATE_FLAG : ['Y' 'N'] Y 3063 N 337 Name: IN_STATE_FLAG, dtype: int64 ==================================== INTERNATIONAL_STS : ['N' 'Y'] N 3373 Y 27 Name: INTERNATIONAL_STS, dtype: int64 ==================================== HOUSING_STS : ['On Campus' 'Off Campus'] Off Campus 1981 On Campus 1419 Name: HOUSING_STS, dtype: int64 ==================================== RETURNED_2ND_YR : [0 1] 1 2677 0 723 Name: RETURNED_2ND_YR, dtype: int64 ====================================
len(demo1)
4
demo2 = []
for column in cf.columns:
if 2 < cf[column].nunique() <= 20:
print(f"{column} : {cf[column].unique()}")
demo2.append(column)
print("====================================")
STDNT_AGE : [18 19 17 20 16 21 23 25 22 26 24] ==================================== STDNT_BACKGROUND : ['BGD 1' 'BGD 3' 'BGD 6' 'BGD 4' 'BGD 8' 'BGD 2' 'BGD 7' 'BGD 5'] ==================================== FIRST_TERM : [200708 200808 201008 200508 200608 200908] ==================================== CORE_COURSE_GRADE_1_F : ['A' 'B' 'C' 'NOT REP' 'D' 'F' 'INCOMPL'] ==================================== CORE_COURSE_GRADE_2_F : ['A' 'B' 'C' 'D' 'F' nan 'NOT REP' 'INCOMPL'] ==================================== CORE_COURSE_GRADE_3_F : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL'] ==================================== CORE_COURSE_GRADE_4_F : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL'] ==================================== CORE_COURSE_GRADE_5_F : [nan 'A' 'B' 'C' 'D'] ==================================== CORE_COURSE_GRADE_6_F : [nan 'A' 'B' 'C' 'D'] ==================================== SECOND_TERM : [200802 200902 201102 200602 200702 201002] ==================================== CORE_COURSE_GRADE_1_S : ['A' 'B' 'C' 'D' 'F' 'NOT REP' 'INCOMPL' 'Unknown' nan] ==================================== CORE_COURSE_GRADE_2_S : ['A' 'B' 'C' 'D' 'F' 'INCOMPL' nan] ==================================== CORE_COURSE_GRADE_3_S : ['A' nan 'B' 'C' 'D' 'F' 'INCOMPL'] ==================================== CORE_COURSE_GRADE_4_S : ['A' nan 'B' 'C' 'D' 'F'] ==================================== CORE_COURSE_GRADE_5_S : [nan 'A' 'B' 'C' 'D' 'F'] ==================================== CORE_COURSE_GRADE_6_S : [nan 'A' 'B' 'C'] ==================================== FATHER_HI_EDU_CD : [nan 3. 2. 1. 4.] ==================================== FATHER_HI_EDU_DESC : ['No Information' 'College/Beyond' 'High School' 'Middle School/Junior High' 'Other/Unknown'] ==================================== MOTHER_HI_EDU_CD : [nan 3. 4. 2. 1.] ==================================== MOTHER_HI_EDU_DESC : ['No Information' 'College/Beyond' 'Other/Unknown' 'High School' 'Middle School/Junior High'] ==================================== DEGREE_GROUP_CD : ['B' 'A' 'V'] ==================================== DEGREE_GROUP_DESC : ['Bachelors' 'Associate' 'Career Associate'] ==================================== FIRST_TERM_ATTEMPT_HRS : [16 18 15 13 12 17 14 19 20 21 9 11] ====================================
len(demo2)
23
demo3 = []
for column in cf.columns:
if 20 < cf[column].nunique() <= 100:
print(f"{column} : {cf[column].unique()}")
demo3.append(column)
print("====================================")
STDNT_MAJOR : ['Undeclared' 'Mathematics' 'Art' 'Political Science' 'Biology' 'Accounting' 'Criminal Justice' 'Psychology' 'Theatre Arts' 'Nursing' 'Pre-Nursing' 'Chemistry' 'Pre-Engineering/RETP' 'Early Childhood Education' 'Music Performance' 'English Language/Literature' 'Joint Enrollment - Accel' 'History and Secondary Ed' 'History' 'Engineering Studies' 'Computer Science - Systems' 'Pre-Business' 'Management' 'Theatre Education' 'Marketing' 'Communication' 'Music Education' 'French with Teacher Cert' 'English and Secondary Ed' 'Exercise Science' 'General Business' 'Mathematics and Secondary Ed' 'Finance' 'French' 'Health Science' 'Spec Ed: Gen. Curr. - Reading' 'Music' 'Middle Grades Education' 'Geology' 'Management Information Systems' 'Art Education' 'Applied Computer Science' 'Information Technology' 'Sociology' 'Computer Science - Games' 'Spanish' 'General Studies/AS' 'Biology and Secondary Ed' 'Chemistry and Secondary Ed' 'Health and Physical Education' 'Liberal Arts' 'Spanish with Teacher Cert' 'Earth and Space Science' 'Early Admission - Accel'] ==================================== STDNT_MINOR : ['N' 'Theatre Arts' 'Foundations of Business' 'Spanish' 'Professional Writing' 'Computer Science - Systems' 'Art' 'Mathematics' 'Chemistry' 'Psychology' "Women's Studies" 'Creative Writing' 'Biology' 'History' 'Early Childhood Education' 'Health Science' 'Accounting' 'Political Science' 'Communication' 'Asian Studies' 'Criminal Justice' 'English Language/Literature' 'Art History' 'Marketing' 'French' 'Economics' 'Sociology' 'International Business' 'Music' 'Exercise Science' 'Finance' 'Computer Info. Management' 'Military Sci & Adv Leadership' 'African Studies' 'Management' 'Anthropology' 'Philosophy'] ==================================== STDNT_TEST_ENTRANCE1 : [nan 26. 27. 28. 23. 21. 19. 31. 25. 20. 14. 22. 30. 24. 18. 32. 29. 16. 15. 17. 8. 11. 13. 12.] ==================================== STDNT_TEST_ENTRANCE_COMB : [1150. 1190. 1030. 1220. 1070. 590. 740. nan 1260. 1110. 990. 910. 1300. 640. 530. 1340. 1380. 950. 690. 870. 790. 1420. 830. 1510. 1460.] ==================================== CORE_COURSE_NAME_1_F : ['ANTH 1105' 'ANTH 1107' 'ARAB 1001' 'ARTH 1100' 'BIOL 1125' 'BIOL 1215K' 'CHEM 1151' 'CHEM 1211' 'COMM 1110' 'COMM 1110H' 'CPSC 1105' 'CPSC 1301' 'ECON 2105' 'ECON 2106' 'ENGL 1101' 'ENGL 1102' 'FREN 1001' 'HIST 1111' 'HIST 2111' 'HIST 2112' 'LEAD 1705' 'MATH 1101' 'MUSC 1100' 'CHEM 1151L' 'ENVS 1105' 'GEOL 1121L' 'HIST 1112' 'HIST 1112I' 'ITDS 1155' 'MATH 1111' 'POLS 1101' 'THEA 1100' 'MATH 1131' 'GEOG 1101I' 'ASTR 1105' 'PSYC 1101' 'ENGL 1101I' 'MATH 1132' 'MATH 1113' 'GEOL 1121' 'ARTH 1100I' 'ITDS 1145' 'ENVS 1105I' 'ITDS 2735' 'POLS 1101H' 'SOCI 1101' 'SPAN 1001' 'CHEM 1211L' 'GEOL 1110' 'STAT 1127' 'MATH 1125' 'CPSC 1301L' 'PHYS 1125' 'ANTH 1105I' 'THEA 1100I' 'COMM 1110I' 'ITDS 2748' 'SPAN 1002' 'PHYS 2311' 'MATH 2125' 'FREN 1002' 'GERM 1002' 'ITDS 2749' 'ASTR 1305' 'ITDS 2726' 'POLS 2401' 'ITDS 2727' 'ITDS 1146' 'ENGL 2111'] ==================================== CORE_COURSE_NAME_2_F : ['ARTH 1100' 'GERM 1002' 'ENGL 1101' 'COMM 1110' 'HIST 2112' 'CHEM 1211' 'ENGL 1102' 'ECON 2105' 'ITDS 2735' 'CHEM 1151L' 'CHEM 1211L' 'MATH 1111' 'FREN 1002' 'ENGL 2111' 'ENVS 1105' 'HIST 1111' 'HIST 2111' 'CHEM 1151' 'BIOL 1215K' 'CPSC 1301L' 'MATH 1113' 'SOCI 1101' 'SPAN 1002' 'MATH 1131' 'MUSC 1100' 'LEAD 1705' 'HIST 1112' 'PSYC 1101' 'THEA 1100' 'POLS 1101' 'CPSC 1105' 'CPSC 1301' 'ITDS 1145' 'FREN 2001' 'GEOL 1121' 'ARTH 1100I' 'COMM 1110I' 'ECON 2106' 'ANTH 1105' 'COMM 1110H' 'ITDS 1155' 'POLS 1101H' 'MATH 1125' 'HIST 1112I' 'SPAN 1001' 'BIOL 1125' 'ASTR 1105' 'PHYS 2311' 'GEOG 1101I' nan 'ENGL 1101I' 'FREN 1001' 'STAT 1127' 'MATH 2115' 'ITDS 1156' 'POLS 2401' 'MATH 1101' 'GEOL 1121L' 'GEOL 1110' 'PHYS 1125' 'JAPN 1001' 'ENGL 2136' 'ITDS 2726' 'ITDS 2749' 'ANTH 1105I' 'ITDS 1146' 'MATH 1132' 'ANTH 1107' 'ENVS 1105I' 'PHYS 1325' 'ITDS 2748' 'THEA 1100I' 'ASTR 1305' 'ITDS 2727' 'MATH 2125'] ==================================== CORE_COURSE_NAME_3_F : ['ENGL 1102' 'ITDS 2735' 'HIST 2111' nan 'CPSC 1105' 'MATH 1101' 'CHEM 1211L' 'ENGL 1101' 'HIST 1111' 'MATH 1132' 'POLS 2401' 'MATH 1113' 'STAT 1127' 'MATH 1111' 'ECON 2105' 'COMM 1110' 'FREN 1001' 'HIST 2112' 'POLS 1101H' 'LEAD 1705' 'MUSC 1100' 'PSYC 1101' 'ITDS 2749' 'THEA 1100' 'HIST 1112' 'ITDS 2726H' 'CHEM 1151L' 'ITDS 1155' 'POLS 1101' 'SOCI 1101' 'SPAN 1002' 'MATH 1125' 'CPSC 1301L' 'ITDS 1145' 'ENGL 1101I' 'ITDS 2726' 'CHEM 1211' 'CHEM 1151' 'BIOL 1215K' 'ARTH 1100' 'MATH 1131' 'JAPN 1001' 'ENVS 1105I' 'SPAN 1001' 'ENVS 1105' 'HIST 1112I' 'COMM 1110H' 'GEOL 1110' 'THEA 1100I' 'ECON 2106' 'PHYS 1125' 'MATH 2125' 'ANTH 1105' 'CPSC 1301' 'FREN 1002' 'GEOG 1101I' 'ITDS 2727' 'ENGL 2111' 'ASTR 1105' 'BIOL 1225K' 'PHYS 1325' 'ASTR 1305' 'ITDS 2748' 'ITDS 2746' 'GEOL 1121L' 'BIOL 1125' 'MATH 2115' 'ARTH 1100I' 'ENGL 2136' 'COMM 1110I' 'ITDS 1146' 'GEOL 1121' 'ANTH 1107' 'PHYS 2211' 'LATN 1002'] ==================================== CORE_COURSE_NAME_4_F : ['PSYC 1101' 'MATH 1125' 'MATH 1111' nan 'MUSC 1100' 'ENGL 1101' 'POLS 1101' 'MATH 1101' 'ECON 2105' 'HIST 1111' 'HIST 2111' 'LEAD 1705' 'HIST 2112' 'MATH 1113' 'THEA 1100' 'ITDS 2749' 'SOCI 1101' 'MATH 2115' 'ENGL 1102' 'ITDS 2735' 'SPAN 1002' 'SPAN 1001' 'PHYS 1125' 'ITDS 2748' 'MATH 1131' 'ECON 2106' 'CHEM 1211L' 'CHEM 1151L' 'CHEM 1151' 'COMM 1110' 'ITDS 2726H' 'ARTH 1100I' 'PHYS 2211' 'POLS 2401' 'GEOG 1101I' 'STAT 1127' 'ITDS 1155' 'PHYS 1325' 'FREN 2001' 'FREN 1001' 'CHEM 1211' 'CPSC 1301L' 'CPSC 1301' 'BIOL 1215K' 'JAPN 1001' 'ARTH 1100' 'HIST 1112' 'COMM 1110H' 'ENGL 1101I' 'CPSC 1105' 'THEA 1100I' 'ITDS 2726' 'ANTH 1105' 'ITDS 2746' 'HIST 1112I' 'ITDS 2727' 'GEOL 1121L' 'ENGL 2136' 'ENVS 1105I' 'ASTR 1305' 'MATH 1132' 'ITDS 1146' 'GEOL 1110' 'COMM 1110I' 'BIOL 1125' 'ENVS 1105' 'PHYS 2311' 'GERM 1002' 'POLS 1101H'] ==================================== CORE_COURSE_NAME_5_F : [nan 'THEA 1100' 'SOCI 1101' 'PSYC 1101' 'HIST 2111' 'POLS 1101' 'ENGL 1101' 'MATH 1101' 'MUSC 1100' 'MATH 1111' 'STAT 1127' 'POLS 1101H' 'JAPN 1001' 'SPAN 1001' 'POLS 2401' 'ITDS 2749' 'CPSC 1105' 'ITDS 2748' 'SPAN 1002' 'MATH 1131' 'ITDS 1156' 'Life and Career Planning' 'ITDS 2735' 'COMM 1110' 'CHEM 1211L' 'MATH 1113' 'PHYS 1325' 'HIST 1112I' 'HIST 2112' 'ITDS 1145' 'ENGL 1102' 'LEAD 1705' 'ECON 2105' 'CHEM 1151L' 'CHEM 1211' 'ARTH 1100' 'ITDS 2726' 'CPSC 1301L' 'ECON 2106' 'FREN 1001' 'MATH 1125' 'BIOL 1215K' 'ENVS 1105I' 'GEOL 1110' 'FREN 1002' 'GEOL 1121' 'ITDS 1146' 'MATH 2115' 'COMM 1110I' 'HIST 1111' 'HIST 1112' 'CPSC 1301' 'ITDS 2746' 'CHEM 1151'] ==================================== CORE_COURSE_NAME_6_F : [nan 'POLS 1101' 'MATH 1113' 'MATH 1111' 'LEAD 1705' 'SOCI 1101' 'MUSC 1100' 'U.S. History to 1865' 'MATH 1101' 'ENGL 1101' 'ITDS 2749' 'CHEM 1151L' 'PSYC 1101' 'COMM 1110' 'HIST 1112' 'THEA 1100' 'ITDS 2735' 'HIST 2112' 'SPAN 1001' 'CHEM 1211L' 'HIST 2111' 'CPSC 1301L' 'MATH 1131' 'FREN 2002' 'ITDS 2748'] ==================================== CORE_COURSE_NAME_1_S : ['CHEM 1151' 'ECON 2105' 'ENGL 1102' 'COMM 1110' 'BIOL 1215K' 'ECON 2106' 'SPAN 1001' 'MATH 1101' 'ARTH 1100' 'BIOL 1225K' 'HIST 2112' 'COMM 1110H' 'CPSC 1105' 'HIST 2111' 'ITDS 2726' 'MATH 1111' 'GEOG 1101I' 'ANTH 1107' 'ENGL 1101' 'ASTR 1105' 'COMM 1110I' 'ANTH 1105I' 'BIOL 1125' 'CPSC 1301' 'ANTH 1105' 'STAT 1127' 'CHEM 1211' 'MUSC 1100' 'FREN 1001' 'SPAN 1002' 'ENGL 2111' 'GEOL 1121' 'ITDS 1145' 'ENVS 1105' 'GEOL 1121L' 'POLS 1101' 'PSYC 1101' 'THEA 1100' 'MATH 1131' 'ENGL 2136' 'MATH 1113' 'FREN 1002' 'CHEM 1151L' 'HIST 1111' 'ITDS 1156' 'GEOL 1110' 'ITDS 2735' 'MATH 1132' 'POLS 2401' 'PHYS 1125' 'SOCI 1101' 'CHEM 1211L' 'MATH 2125' 'FREN 2001' 'HIST 1112' 'HIST 1112I' 'ITDS 1155' 'FREN 2002' 'ENGL 1101I' 'MATH 1125' 'MATH 2115' 'CPSC 1301L' 'ITDS 2749' 'ITDS 2727' 'POLS 1101H' 'ITDS 2748' nan] ==================================== CORE_COURSE_NAME_2_S : ['CHEM 1151L' 'ENGL 1102' 'ENVS 1105' 'HIST 2112' 'ENGL 1101' 'ITDS 2735' 'HIST 2111' 'STAT 1127' 'THEA 1100' 'CHEM 1211' 'MUSC 1100' 'PSYC 1101' 'COMM 1110' 'FREN 1002' 'MATH 1131' 'COMM 1110H' 'FREN 2001' 'MATH 1111' 'SPAN 1001' 'POLS 1101' 'ITDS 2727' 'ECON 2105' 'POLS 2401' 'MATH 1132' 'MATH 1125' 'MATH 1101' 'GEOG 1101I' 'MATH 1113' 'ITDS 2748' 'ITDS 1145' 'PHYS 1125' 'CHEM 1151' 'BIOL 1215K' 'HIST 1112I' 'GEOL 1110' 'CPSC 1301L' 'ENGL 2111' 'ITDS 1156' 'ECON 2106' 'ITDS 1146' 'HIST 1112' 'CHEM 1211L' 'SOCI 1101' 'CPSC 1105' 'SPAN 1002' 'LEAD 1705' 'ENGL 2136' 'GEOL 1121L' 'ITDS 2726' 'ARTH 1100' 'FREN 1001' 'COMM 1110I' 'BIOL 1125' 'ANTH 1105' 'ANTH 1105I' 'CPSC 1301' 'ASTR 1105' 'ITDS 1155' 'ITDS 2749' 'HIST 1111' 'GEOL 1121' 'ANTH 1107' 'POLS 1101H' 'ENGL 1101I' 'GEOL 2225' 'MATH 2115' 'BIOL 1225K' 'MATH 2125' 'LATN 1002' nan] ==================================== CORE_COURSE_NAME_3_S : ['COMM 1110' nan 'MUSC 1100' 'LEAD 1705' 'HIST 2111' 'ITDS 1156' 'MATH 1111' 'HIST 1112' 'POLS 1101H' 'CHEM 1211L' 'ENGL 1102' 'PSYC 1101' 'POLS 1101' 'ENVS 1105' 'MATH 1113' 'ITDS 1145' 'ITDS 2735' 'HIST 2112' 'MATH 1125' 'ITDS 2726' 'THEA 1100' 'MATH 1101' 'SPAN 1001' 'SOCI 1101' 'MATH 1131' 'ENGL 1101' 'ENGL 2136' 'MATH 1132' 'GEOL 1110' 'CHEM 1151L' 'ITDS 2727' 'GEOL 1121' 'ASTR 1105' 'FREN 2001' 'STAT 1127' 'PHYS 1125' 'ECON 2105' 'PHYS 1325' 'ITDS 2749' 'ECON 2106' 'SPAN 1002' 'HIST 1111' 'BIOL 1215K' 'ITDS 2746' 'FREN 1001' 'ITDS 1155' 'ITDS 2748' 'FREN 1002' 'CHEM 1151' 'FREN 2002' 'POLS 2401' 'GEOL 1121L' 'GEOG 1101I' 'ARTH 1100' 'CPSC 1105' 'ANTH 1105I' 'CHEM 1211' 'CPSC 1301' 'ANTH 1107' 'BIOL 1125' 'ITDS 1146' 'MATH 2125' 'HIST 1112I' 'ENGL 2111' 'ANTH 1105' 'ENGL 1101I' 'CPSC 1301L' 'MATH 2115' 'COMM 1110I'] ==================================== CORE_COURSE_NAME_4_S : ['ITDS 2749' nan 'MUSC 1100' 'PSYC 1101' 'POLS 1101' 'MATH 1125' 'SPAN 1001' 'HIST 1111' 'ENGL 1102' 'HIST 2112' 'POLS 2401' 'MATH 1101' 'ITDS 2735' 'HIST 2111' 'MATH 1111' 'HIST 1112' 'STAT 1127' 'ITDS 1145' 'GEOL 1121L' 'COMM 1110' 'THEA 1100' 'LEAD 1705' 'ITDS 1155' 'MATH 1113' 'MATH 1132' 'ENGL 1101' 'ECON 2106' 'MATH 1131' 'SOCI 1101' 'ITDS 2726' 'ITDS 2746' 'PHYS 1125' 'ITDS 2748' 'FREN 2001' 'SPAN 1002' 'ITDS 1146' 'CHEM 1211L' 'CPSC 1301L' 'FREN 1002' 'CHEM 1211' 'ARTH 1100' 'CPSC 1105' 'FREN 1001' 'GEOG 1101I' 'CHEM 1151' 'BIOL 1215K' 'CHEM 1151L' 'ECON 2105' 'GEOL 1110' 'COMM 1110H' 'ASTR 1105' 'COMM 1110I' 'ITDS 2727' 'MATH 2125' 'PHYS 1325' 'ANTH 1105I' 'ENVS 1105' 'ENGL 2136' 'BIOL 1225K' 'ANTH 1105' 'GEOL 1121' 'CPSC 1301' 'ITDS 1156' 'HIST 1112I' 'ENGL 1101I'] ==================================== CORE_COURSE_NAME_5_S : [nan 'MATH 1113' 'MATH 1111' 'ITDS 2749' 'POLS 1101' 'MUSC 1100' 'PSYC 1101' 'ITDS 2748' 'SPAN 1002' 'FREN 1001' 'ENGL 2136' 'MATH 1101' 'THEA 1100' 'SPAN 1001' 'ITDS 2746' 'MATH 2125' 'ITDS 1145' 'STAT 1127' 'ENGL 1102' 'ITDS 2735' 'ITDS 1146' 'SOCI 1101' 'MATH 1132' 'ITDS 2726' 'CPSC 1301L' 'ITDS 2727' 'COMM 1110' 'HIST 2112' 'CHEM 1211L' 'CHEM 1151L' 'ENVS 1105' 'MATH 1125' 'HIST 2111' 'CHEM 1151' 'CPSC 1105' 'GEOL 1121L' 'ECON 2105' 'HIST 1112' 'ITDS 1156' 'MATH 1131' 'ECON 2106' 'ENGL 1101' 'ITDS 1155' 'PHYS 1125' 'LEAD 1705' 'GEOL 1110' 'POLS 2401' 'FREN 1002' 'ARTH 1100' 'HIST 1111' 'CPSC 1301' 'BIOL 1215K'] ==================================== CORE_COURSE_NAME_6_S : [nan 'PSYC 1101' 'POLS 1101' 'HIST 2112' 'STAT 1127' 'SPAN 1001' 'MUSC 1100' 'ENGL 1101' 'SPAN 1002' 'MATH 1101' 'ENGL 1102' 'CHEM 1151L' 'ITDS 2746' 'ITDS 2735' 'LEAD 1705' 'SOCI 1101' 'MATH 1111' 'ITDS 2727' 'MATH 1125' 'ITDS 2749' 'THEA 1100' 'COMM 1110' 'HIST 2111' 'CPSC 1105' 'ITDS 2748' 'CPSC 1301L' 'CHEM 1151'] ==================================== DISTANCE_FROM_HOME : [ 150. 69. 0. 4501. 661. 138. 283. 90. 188. 614. nan 59. 997. 181. 666. 190. 176. 222. 208. 778. 235. 513. 215. 91. 310. 246. 722. 137. 572. 1854. 117. 136. 5932. 237. 441. 274. 270. 301. 703. 398. 275. 225. 355. 299. 871. 711. 151. 1270. 2221. 528. 694. 316. 477. 1735. 574. 421. 1078. 769. 1222. 330. 382. 517. 403. 609. 1161. 556. 1001. 919. 2203. 366. 786. 1216.] ==================================== FIRST_TERM_EARNED_HRS : [16 18 15 13 12 17 14 10 9 11 8 7 6 19 21 0 2 4 3 5 20 1] ==================================== SECOND_TERM_ATTEMPT_HRS : [14. 18. 12. 16. 15. 13. 17. 7. 19. 9. 6. 21. 20. 11. 23. 10. 22. 8. 4. nan 3. 2.] ==================================== SECOND_TERM_EARNED_HRS : [14. 18. 12. 16. 15. 13. 17. 7. 19. 8. 9. 10. 11. 6. 21. 20. 23. nan 3. 5. 4. 0. 2. 1.] ====================================
len(demo3)
20
demo4 = []
for column in cf.columns:
if cf[column].nunique() > 100:
print(f"{column} : {cf[column].unique()}")
demo4.append(column)
print("====================================")
STUDENT IDENTIFIER : [7808615 7830063 7847538 ... 7926915 7877332 7928405]
====================================
STDNT_TEST_ENTRANCE2 : [1150. nan 1020. 1210. 1200. 1050. 580. 750. 1060. 1250. 1230. 1220.
1040. 1110. 930. 1310. 1300. 610. 1120. 650. 550. 620. 600. 590.
1130. 1350. 1080. 510. 1330. 1170. 1140. 1180. 1270. 1160. 1010. 1000.
1360. 1190. 640. 500. 740. 1260. 1100. 1390. 1070. 690. 940. 560.
980. 460. 920. 950. 410. 990. 1290. 1280. 1240. 1030. 900. 1090.
570. 480. 780. 890. 1370. 970. 450. 1320. 960. 1420. 1400. 520.
530. 630. 1340. 490. 880. 840. 540. 910. 830. 860. 470. 820.
870. 760. 850. 810. 400. 800. 1430. 770. 660. 420. 700. 430.
390. 1380. 370. 440. 1490. 790. 730. 670. 720. 360. 350. 1440.
1410. 710.]
====================================
HIGH_SCHL_GPA : [4. 2.89 3.39 2.93 3.86 3.93 3.69 3.34 3.96 3.35
nan 3.5 2.4 3.71 3.72 3.73 3.12 3.57 3.81 3.9
3.92 3.43 3.6 2.97 3.58 3.79 3.54 3.46 3.67 2.9
3.63 3.8 3.75 3.56 3.53 3.78 3.33 3. 3.61 3.3
3.21 3.37 2.66 3.82 3.62 2.88 3.87 3.88 3.26 3.7
3.74 2.8 2.63 3.47 3.52 3.68 3.25 3.28 3.06 3.66
3.64 3.4375 3.85 3.83 3.07 2.96 3.36 3.16 3.15 2.76
3.31 3.77 3.13 3.4 3.18 3.27 3.84 3.19 2.92 3.94
2.73 3.14 2.81 2.64 3.08 3.09 3.65 3.38 2.51 3.1
3.97 3.2 2.6 3.17 3.23 2.84 2.87 3.42 3.03 2.77
3.55 2.72 2.79 2.94 3.76 2.58 3.04 2.61 2.85 2.86
2.71 3.32 2.82 3.95 2.21 2.33 2.78 2.62 2.54 3.29
2.83 2.75 2.57 2.31 2.5 2.38 3.11 2.26 2.74 2.45
3.59 2.44 2.37 2.7 3.48 2.43 2.1 3.22 2.42 2.46
2.25 2.69 3.875 2.923 2. 3.44 3.41 2.04 2.65 3.625
2.67 2.68 2.47 3.24 2.36 2.59 2.656 2.29 2.53 2.32
2.41 2.875 2.56 3.9375 2.91 2.07 2.55 2.52 2.3 2.375
2.35 2.2 3.45 2.13 2.17 3.01 2.15 2.687 2.95 2.18
3.51 3.688 2.27 2.625 2.28 1.92 2.23 2.48 2.06 2.39
2.24 0. 2.34 2.813 3.148 2.16 3.89 2.08 1.87 3.6875
1.94 ]
====================================
HIGH_SCHL_NAME : ['SCHOOL 50' 'SCHOOL 389' 'SCHOOL 82' 'SCHOOL 180' 'SCHOOL 2' 'SCHOOL 5'
'SCHOOL 130' 'SCHOOL 54' 'SCHOOL 189' 'SCHOOL 342' 'SCHOOL 11'
'SCHOOL 28' 'SCHOOL 19' 'SCHOOL 98' 'SCHOOL 14' 'SCHOOL 121' 'SCHOOL 15'
'SCHOOL 283' 'SCHOOL 304' 'SCHOOL 61' 'SCHOOL 229' 'SCHOOL 148'
'SCHOOL 298' 'SCHOOL 175' 'SCHOOL 26' 'SCHOOL 88' 'SCHOOL 131'
'SCHOOL 33' 'SCHOOL 128' 'SCHOOL 8' 'SCHOOL 223' 'SCHOOL 27' 'SCHOOL 52'
'SCHOOL 29' 'SCHOOL 70' 'SCHOOL 20' 'SCHOOL 184' 'SCHOOL 453' 'SCHOOL 72'
'SCHOOL 7' 'SCHOOL 25' 'SCHOOL 1' 'SCHOOL 18' 'SCHOOL 9' 'SCHOOL 216'
'SCHOOL 10' 'SCHOOL 311' 'SCHOOL 34' 'SCHOOL 43' 'SCHOOL 45' 'SCHOOL 74'
'SCHOOL 423' 'SCHOOL 132' 'SCHOOL 46' 'SCHOOL 47' 'SCHOOL 200'
'SCHOOL 83' 'SCHOOL 292' 'SCHOOL 129' 'SCHOOL 66' 'SCHOOL 12'
'SCHOOL 319' 'SCHOOL 3' 'SCHOOL 85' 'SCHOOL 49' 'SCHOOL 339' 'SCHOOL 23'
'SCHOOL 111' 'SCHOOL 366' 'SCHOOL 440' 'SCHOOL 141' 'SCHOOL 13'
'SCHOOL 48' 'SCHOOL 91' 'SCHOOL 22' 'SCHOOL 236' 'SCHOOL 263' 'SCHOOL 37'
'SCHOOL 95' 'SCHOOL 55' 'SCHOOL 125' 'SCHOOL 196' 'SCHOOL 277'
'SCHOOL 367' 'SCHOOL 317' 'SCHOOL 159' 'SCHOOL 38' 'SCHOOL 4' 'SCHOOL 87'
'SCHOOL 78' 'SCHOOL 16' 'SCHOOL 336' 'SCHOOL 149' 'SCHOOL 155'
'SCHOOL 177' 'SCHOOL 523' 'SCHOOL 285' 'SCHOOL 171' 'SCHOOL 65'
'SCHOOL 262' 'SCHOOL 410' 'SCHOOL 30' 'SCHOOL 17' 'SCHOOL 230'
'SCHOOL 57' 'SCHOOL 42' 'SCHOOL 284' 'SCHOOL 114' 'SCHOOL 386' nan
'SCHOOL 53' 'SCHOOL 419' 'SCHOOL 226' 'SCHOOL 340' 'SCHOOL 288'
'SCHOOL 154' 'SCHOOL 274' 'SCHOOL 6' 'SCHOOL 123' 'SCHOOL 384'
'SCHOOL 515' 'SCHOOL 163' 'SCHOOL 505' 'SCHOOL 327' 'SCHOOL 24'
'SCHOOL 295' 'SCHOOL 434' 'SCHOOL 416' 'SCHOOL 502' 'SCHOOL 477'
'SCHOOL 93' 'SCHOOL 158' 'SCHOOL 312' 'SCHOOL 412' 'SCHOOL 219'
'SCHOOL 310' 'SCHOOL 79' 'SCHOOL 73' 'SCHOOL 272' 'SCHOOL 120'
'SCHOOL 153' 'SCHOOL 486' 'SCHOOL 315' 'SCHOOL 135' 'SCHOOL 100'
'SCHOOL 234' 'SCHOOL 273' 'SCHOOL 137' 'SCHOOL 179' 'SCHOOL 428'
'SCHOOL 452' 'SCHOOL 44' 'SCHOOL 152' 'SCHOOL 115' 'SCHOOL 139'
'SCHOOL 398' 'SCHOOL 99' 'SCHOOL 140' 'SCHOOL 369' 'SCHOOL 157'
'SCHOOL 333' 'SCHOOL 183' 'SCHOOL 127' 'SCHOOL 150' 'SCHOOL 126'
'SCHOOL 258' 'SCHOOL 268' 'SCHOOL 278' 'SCHOOL 259' 'SCHOOL 41'
'SCHOOL 143' 'SCHOOL 314' 'SCHOOL 489' 'SCHOOL 119' 'SCHOOL 465'
'SCHOOL 395' 'SCHOOL 31' 'SCHOOL 276' 'SCHOOL 84' 'SCHOOL 108'
'SCHOOL 69' 'SCHOOL 62' 'SCHOOL 356' 'SCHOOL 232' 'SCHOOL 467'
'SCHOOL 521' 'SCHOOL 35' 'SCHOOL 397' 'SCHOOL 107' 'SCHOOL 151'
'SCHOOL 334' 'SCHOOL 162' 'SCHOOL 270' 'SCHOOL 293' 'SCHOOL 144'
'SCHOOL 71' 'SCHOOL 192' 'SCHOOL 133' 'SCHOOL 308' 'SCHOOL 378'
'SCHOOL 245' 'SCHOOL 519' 'SCHOOL 483' 'SCHOOL 104' 'SCHOOL 89'
'SCHOOL 279' 'SCHOOL 321' 'SCHOOL 482' 'SCHOOL 185' 'SCHOOL 40'
'SCHOOL 271' 'SCHOOL 424' 'SCHOOL 136' 'SCHOOL 501' 'SCHOOL 197'
'SCHOOL 240' 'SCHOOL 326' 'SCHOOL 267' 'SCHOOL 204' 'SCHOOL 146'
'SCHOOL 260' 'SCHOOL 318' 'SCHOOL 286' 'SCHOOL 51' 'SCHOOL 316'
'SCHOOL 431' 'SCHOOL 244' 'SCHOOL 337' 'SCHOOL 36' 'SCHOOL 181'
'SCHOOL 459' 'SCHOOL 466' 'SCHOOL 256' 'SCHOOL 422' 'SCHOOL 92'
'SCHOOL 238' 'SCHOOL 350' 'SCHOOL 508' 'SCHOOL 287' 'SCHOOL 76'
'SCHOOL 289' 'SCHOOL 178' 'SCHOOL 469' 'SCHOOL 325' 'SCHOOL 86'
'SCHOOL 457' 'SCHOOL 294' 'SCHOOL 492' 'SCHOOL 21' 'SCHOOL 518'
'SCHOOL 341' 'SCHOOL 464' 'SCHOOL 429' 'SCHOOL 381' 'SCHOOL 399'
'SCHOOL 522' 'SCHOOL 32' 'SCHOOL 198' 'SCHOOL 235' 'SCHOOL 455'
'SCHOOL 280' 'SCHOOL 487' 'SCHOOL 463' 'SCHOOL 203' 'SCHOOL 253'
'SCHOOL 338' 'SCHOOL 186' 'SCHOOL 361' 'SCHOOL 67' 'SCHOOL 81'
'SCHOOL 403' 'SCHOOL 39' 'SCHOOL 182' 'SCHOOL 112' 'SCHOOL 56'
'SCHOOL 207' 'SCHOOL 313' 'SCHOOL 227' 'SCHOOL 249' 'SCHOOL 134'
'SCHOOL 371' 'SCHOOL 63' 'SCHOOL 370' 'SCHOOL 520' 'SCHOOL 257'
'SCHOOL 239' 'SCHOOL 80' 'SCHOOL 106' 'SCHOOL 300' 'SCHOOL 394'
'SCHOOL 524' 'SCHOOL 176' 'SCHOOL 220' 'SCHOOL 447' 'SCHOOL 265'
'SCHOOL 527' 'SCHOOL 462' 'SCHOOL 510' 'SCHOOL 533' 'SCHOOL 222'
'SCHOOL 218' 'SCHOOL 425' 'SCHOOL 441' 'SCHOOL 332' 'SCHOOL 213'
'SCHOOL 142' 'SCHOOL 160' 'SCHOOL 190' 'SCHOOL 344' 'SCHOOL 495'
'SCHOOL 493' 'SCHOOL 382' 'SCHOOL 174' 'SCHOOL 385' 'SCHOOL 451'
'SCHOOL 122' 'SCHOOL 473' 'SCHOOL 161' 'SCHOOL 363' 'SCHOOL 138'
'SCHOOL 470' 'SCHOOL 497' 'SCHOOL 390' 'SCHOOL 354' 'SCHOOL 359'
'SCHOOL 335' 'SCHOOL 90' 'SCHOOL 170' 'SCHOOL 251' 'SCHOOL 372'
'SCHOOL 496' 'SCHOOL 169' 'SCHOOL 324' 'SCHOOL 488' 'SCHOOL 409'
'SCHOOL 388' 'SCHOOL 215' 'SCHOOL 302' 'SCHOOL 167' 'SCHOOL 357'
'SCHOOL 415' 'SCHOOL 436' 'SCHOOL 432' 'SCHOOL 231' 'SCHOOL 195'
'SCHOOL 109' 'SCHOOL 201' 'SCHOOL 413' 'SCHOOL 526' 'SCHOOL 291'
'SCHOOL 60' 'SCHOOL 101' 'SCHOOL 322' 'SCHOOL 420' 'SCHOOL 499'
'SCHOOL 228' 'SCHOOL 212' 'SCHOOL 247' 'SCHOOL 476' 'SCHOOL 68'
'SCHOOL 468' 'SCHOOL 254' 'SCHOOL 438' 'SCHOOL 379' 'SCHOOL 512'
'SCHOOL 297' 'SCHOOL 481' 'SCHOOL 490' 'SCHOOL 77' 'SCHOOL 113'
'SCHOOL 461' 'SCHOOL 358' 'SCHOOL 264' 'SCHOOL 517' 'SCHOOL 404'
'SCHOOL 511' 'SCHOOL 418' 'SCHOOL 530' 'SCHOOL 444' 'SCHOOL 374'
'SCHOOL 484' 'SCHOOL 214' 'SCHOOL 329' 'SCHOOL 525' 'SCHOOL 194'
'SCHOOL 498' 'SCHOOL 458' 'SCHOOL 348' 'SCHOOL 509' 'SCHOOL 383'
'SCHOOL 241' 'SCHOOL 411' 'SCHOOL 124' 'SCHOOL 320' 'SCHOOL 408'
'SCHOOL 117' 'SCHOOL 206' 'SCHOOL 471' 'SCHOOL 349' 'SCHOOL 248'
'SCHOOL 479' 'SCHOOL 391' 'SCHOOL 507' 'SCHOOL 105' 'SCHOOL 347'
'SCHOOL 402' 'SCHOOL 494' 'SCHOOL 147' 'SCHOOL 387' 'SCHOOL 59'
'SCHOOL 103' 'SCHOOL 430' 'SCHOOL 303' 'SCHOOL 187' 'SCHOOL 433'
'SCHOOL 94' 'SCHOOL 102' 'SCHOOL 503' 'SCHOOL 172' 'SCHOOL 233'
'SCHOOL 96' 'SCHOOL 475' 'SCHOOL 299' 'SCHOOL 58' 'SCHOOL 217'
'SCHOOL 435' 'SCHOOL 380' 'SCHOOL 375' 'SCHOOL 392' 'SCHOOL 165'
'SCHOOL 421' 'SCHOOL 368' 'SCHOOL 118' 'SCHOOL 439' 'SCHOOL 255'
'SCHOOL 243' 'SCHOOL 450' 'SCHOOL 211' 'SCHOOL 437' 'SCHOOL 414']
====================================
GROSS_FIN_NEED : [ 0 570000 835920 203040 1083000 567000 861720 849660 1165740
15180 355200 1297440 1285860 1255740 1299360 806760 1360620 1146420
1192380 1410900 1504980 431700 1346640 1324740 1418100 1333200 1324500
1206780 699840 1187280 177960 625860 708900 471540 1316580 969180
909960 1068720 798840 1029600 100440 1042560 1247160 1340880 163620
1230600 1337460 38280 1171200 380820 1258200 709740 244140 988440
1010160 842940 522300 1140720 1109460 232980 1145880 769860 706500
221580 700320 830340 152460 1127160 1199340 819960 1489080 811500
502140 670920 1177560 640440 154680 700440 1111140 951480 326940
1117740 921360 982740 121020 37380 848520 858540 445560 1376280
739500 73860 847440 116460 983160 1254720 696120 1289880 902220
1167840 490500 1093200 521040 1176420 526740 798960 666900 1165920
574380 389220 983220 827940 173400 1306020 88920 1098600 1149960
1392240 626940 268680 1503900 655980 298620 462480 489900 882240
564660 1272420 1305120 282960 411540 783180 1102500 381240 706200
776340 664380 568800 1041720 194880 529920 201300 518460 807960
1263600 1243200 1404600 896100 1091940 585720 1171380 469020 1128120
924840 1043580 222900 717660 1076160 795240 373320 924000 1228320
1029900 647100 1199820 625500 575040 306480 336240 794340 842220
493260 144780 679260 467580 1316700 1255140 937200 740880 1230000
438540 821580 580440 365940 1001280 1022460 485400 1201500 1138020
1186860 728040 29820 571260 1017960 488880 1186200 291360 1015680
1038540 803520 1300620 1245480 542640 1281120 891540 1206480 794400
1122240 698040 1110720 329040 349860 873060 1158060 499020 717840
289800 1201260 718200 532740 1176300 118740 593040 1410180 1409460
626400 392220 1159620 744840 1225500 921540 553320 219240 311160
809760 1049520 13920 1122120 853800 722640 802860 10020 864900
196260 460320 898860 1283280 970080 1144200 872760 244320 1005540
482520 855840 1034820 1096440 357900 774540 145560 1138980 358860
1173540 353700 1254240 243720 876480 218160 1024800 82500 647160
1296900 703440 595080 753840 666660 756300 848700 896280 225240
1308360 217380 668940 1186500 695220 682620 845520 151920 1154640
910980 780420 1134480 394380 1203120 607380 207660 916680 142980
917880 1006860 1091040 254880 869760 573960 619440 566040 1329000
1230900 1411800 897900 1076220 266700 813780 895320 1156200 661740
1022520 1280820 118260 1131180 1140120 672120 257100 390960 355380
370320 533100 577500 586440 1340940 1312260 1388100 175500 738300
1075020 996000 809940 851220 1381320 770700 1040820 86220 1175880
170640 1250160 1031520 632862 945300 695700 540480 260820 1010640
572400 961200 872640 1052640 772140 788580 189420 419460 1222680
461760 1084680 156660 930720 126780 603360 470160 1092720 990600
198900 1297020 1187580 747180 984000 585120 1056120 855660 1246200
1058640 407940 67620 516360 1236960 929400 18540 773460 1011660
634380 1008000 1073220 156480 936180 951780 726900 978600 280140
1300680 607560 1291440 735840 974040 1053480 1152540 984420 473880
991380 229200 946560 580380 1151640 1056540 1315800 1410540 672540
1417380 855960 43620 1190760 701520 1166100 1407840 1063080 1053600
106380 562800 966300 606120 1357200 931980 1194540 905940 286980
1241520 996480 35700 123540 727680 282780 533640 832920 825000
516240 1252500 857580 772440 1257540 240060 298380 722340 559020
1288860 1248360 447780 632040 1175400 436920 66420 913500 479280
467340 791940 972360 740100 1056780 907380 816720 738480 228120
36660 524280 562440 1165560 606840 1183080 301440 423180 1185960
968460 1338540 1011780 785220 1012320 1416780 772500 1109340 1141800
1096740 119520 1378080 1559220 682200 167940 141720 86580 576540
254760 1180020 1325460 1933800 862140 250140 1314420 1230660 819600
1200600 448140 1285920 162600 365520 89760 1206180 907860 1377780
161880 516120 816480 745980 756720 703260 547920 1425300 1214880
689760 184620 493500 479700 446640 1131600 578640 1352160 1130760
298260 628200 227460 1350060 1233900 774960 237000 1172040 575220
866820 988560 852060 347040 1197300 989400 520620 1183440 1299540
895620 1413240 504420 875340 1016880 559620 304200 761760 766560
1264860 551940 692760 1171800 1349880 845700 800100 864660 1407060
1131960 596820 1141980 118140 673800 800400 1016760 492360 681480
654660 1412520 394800 1025220 497640 702600 1354380 935040 1002540
711720 490440 384900 1087320 1284780 871980 903300 802020 1276140
1214100 484320 708720 164460 1160400 1111500 47220 189060 1160040
567060 719340 861060 433740 720060 192180 959280 413460 1260000
1109400 1038120 133260 1169820 507840 594240 729360 570780 258960
1334940 1004820 1267380 1086060 875880 867060 840960 728640 771480
1321020 271440 633480 784980 222540 504000 1341600 1144800 924900
1073100 738960 161520 1304940 910320 1061040 747900 1009380 1336500
568260 791880 858660 212280 176580 1396560 1032840 497220 285360
650640 973320 203400 742800 722520 819720 934560 899640 854580
478380 882060 1308480 1136220 261600 430380 301080 965040 461400
625020 1064160 314640 733860 604380 460740 567120 1057860 836880
953880 866040 846120 1093260 1488000 730740 621420 902760 91740
598320 47100 1002480 28320 262140 704460 1226400 1011240 1032180
1310100 1149420 1016040 1102380 484500 751980 945840 654600 454260
1233420 1140780 1149180 389280 1175100 1189920 1203660 537000 1169700
1412460 81120 414060 1355760 1165860 945420 856560 141900 1028520
1140660 811140 1331940 1236840 1282860 266460 1010760 622440 1163220
1225080 336960 185100 1190520 940500 1491660 582600 479640 1192440
671700 1006800 1197900 576240 491700 25500 346200 1459020 1199100
581340 424440 487500 1271820 372360 1145160 597420 114000 340200
766680 1354740 1461780 1077480 594540 989340 467100 814260 1019220
1130040 435000 269700 794640 941340 987000 1124940 970800 211140
569220 440940 630540 1152120 276780 32640 796200 1260840 1257780
1104060 42420 72840 231060 1017720 1073760 1322700 2124900 1175760
550260 603660 159840 129540 1299180 1153380 860520 290880 1383900
1102740 458460 974400 865620 813600 38820 204420 1213320 453660
812220 25800 239760 339840 1102440 1072200 132780 53340 1135560
1341360 1079820 1344420 572340 485040 331200 1317900 744420 754260
1062480 493320 636300 268320 2880 403560 965400 1244580 939780
645240 631020 22380 250920 1114020 1372560 864480 894060 324300
14880 899580 233040 829380 565560 828000 355140 1029720 521700
16860 698580 570180 1292340 891480 909480 665100 1182180 1213980
798480 1716660 115140 53880 1319640 1316220 1037700 941760 1274580
1079760 1028040 1141020 759540 491460 1110840 926280 907440 867900
417900 787140 699300 299820 1009440 1103340 1157700 1049220 635220
583620 427320 575880 1222080 1288080 1049760 610740 949800 911100
1143780 444240 619200 866400 648960 689100 552960 831720 115500
1058580 60180 1724820 123840 536700 1216080 308340 632940 244560]
====================================
COST_OF_ATTEND : [ 0 1355760 1264860 1189920 1255740 1324500 835680 1176420 403680
1151640 1410180 1146420 224640 1504980 199440 1297440 326880 1285860
1418100 150540 301080 902760 1245480 1306020 1409460 451620 1346640
1025700 1233900 1417380 336960 216480 1171380 1187280 1160520 1316580
1201500 389520 1138980 299160 259680 1459020 1296900 949560 945840
1489080 269100 1466460 134580 915000 983160 108960 1392240 112320
326640 1010640 240120 939000 933000 1270680 1144200 360180 349860
392220 339840 217920 357900 831720 860640 848700 99720 955500
1309560 108240 840540 869760 1393560 861060 908160 852480 208560
1236960 941700 894060 1071780 2082900 1025160 342720 1559220 369000
278280 845940 1804860 588120 1474800 976320 733860 315960 352860
331740 457560 417900 766680 1019760 372600 1019220 970800 120060
905340 2124900 807420 830100 347580 1119300 392400 472980 1133940
1432500 833160 1862040]
====================================
EST_FAM_CONTRIBUTION : [ 0 785760 519840 ... 2177700 1505400 1288980]
====================================
UNMET_NEED : [ 0. 459300. 278340. ... 34560. -49860. -289140.]
====================================
len(demo4)
8
# It shows sum of missing values for each variables
cf.isnull().sum()
STUDENT IDENTIFIER 0 STDNT_AGE 0 STDNT_GENDER 0 STDNT_BACKGROUND 0 IN_STATE_FLAG 0 INTERNATIONAL_STS 0 STDNT_MAJOR 0 STDNT_MINOR 0 STDNT_TEST_ENTRANCE1 2294 STDNT_TEST_ENTRANCE2 908 STDNT_TEST_ENTRANCE_COMB 518 FIRST_TERM 0 CORE_COURSE_NAME_1_F 0 CORE_COURSE_GRADE_1_F 0 CORE_COURSE_NAME_2_F 99 CORE_COURSE_GRADE_2_F 99 CORE_COURSE_NAME_3_F 565 CORE_COURSE_GRADE_3_F 565 CORE_COURSE_NAME_4_F 1597 CORE_COURSE_GRADE_4_F 1597 CORE_COURSE_NAME_5_F 2755 CORE_COURSE_GRADE_5_F 2755 CORE_COURSE_NAME_6_F 3272 CORE_COURSE_GRADE_6_F 3272 SECOND_TERM 0 CORE_COURSE_NAME_1_S 157 CORE_COURSE_GRADE_1_S 232 CORE_COURSE_NAME_2_S 439 CORE_COURSE_GRADE_2_S 439 CORE_COURSE_NAME_3_S 1038 CORE_COURSE_GRADE_3_S 1038 CORE_COURSE_NAME_4_S 2045 CORE_COURSE_GRADE_4_S 2045 CORE_COURSE_NAME_5_S 2950 CORE_COURSE_GRADE_5_S 2950 CORE_COURSE_NAME_6_S 3319 CORE_COURSE_GRADE_6_S 3319 HOUSING_STS 0 RETURNED_2ND_YR 0 DISTANCE_FROM_HOME 25 HIGH_SCHL_GPA 53 HIGH_SCHL_NAME 1 FATHER_HI_EDU_CD 432 FATHER_HI_EDU_DESC 0 MOTHER_HI_EDU_CD 489 MOTHER_HI_EDU_DESC 0 DEGREE_GROUP_CD 0 DEGREE_GROUP_DESC 0 FIRST_TERM_ATTEMPT_HRS 0 FIRST_TERM_EARNED_HRS 0 SECOND_TERM_ATTEMPT_HRS 206 SECOND_TERM_EARNED_HRS 209 GROSS_FIN_NEED 0 COST_OF_ATTEND 0 EST_FAM_CONTRIBUTION 0 UNMET_NEED 0 dtype: int64
#It is group data with two different variables named under att
att=cf.groupby('RETURNED_2ND_YR')
# This shows the data type of that column
type('RETURNED_2ND_YR')
str
#it shows mean of all the variables wrt to 0 and 1
att.mean()
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | STDNT_TEST_ENTRANCE_COMB | FIRST_TERM | SECOND_TERM | DISTANCE_FROM_HOME | HIGH_SCHL_GPA | FATHER_HI_EDU_CD | MOTHER_HI_EDU_CD | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RETURNED_2ND_YR | |||||||||||||||||||
| 0 | 7.899963e+06 | 18.015214 | 20.550000 | 970.705882 | 999.403974 | 200787.391425 | 200881.391425 | 112.214586 | 3.131802 | 2.527419 | 2.535000 | 13.988935 | 11.809129 | 13.906250 | 11.466912 | 307243.128631 | 549032.199170 | 315265.643154 | 90679.460581 |
| 1 | 7.896397e+06 | 17.988420 | 20.690531 | 963.647830 | 997.322212 | 200774.193500 | 200868.193500 | 96.506762 | 3.221768 | 2.569421 | 2.580701 | 13.985058 | 12.311916 | 14.363774 | 12.725727 | 302365.237206 | 552209.069854 | 352636.787449 | 67235.153455 |
The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types
cf.describe()
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | STDNT_TEST_ENTRANCE_COMB | FIRST_TERM | SECOND_TERM | RETURNED_2ND_YR | DISTANCE_FROM_HOME | HIGH_SCHL_GPA | FATHER_HI_EDU_CD | MOTHER_HI_EDU_CD | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.400000e+03 | 3400.000000 | 1106.000000 | 2492.000000 | 2882.000000 | 3400.000000 | 3400.000000 | 3400.000000 | 3375.000000 | 3347.000000 | 2968.000000 | 2911.000000 | 3400.000000 | 3400.000000 | 3194.000000 | 3191.000000 | 3.400000e+03 | 3.400000e+03 | 3.400000e+03 | 3.400000e+03 |
| mean | 7.897155e+06 | 17.994118 | 20.660036 | 965.092295 | 997.758501 | 200777.000000 | 200871.000000 | 0.787353 | 99.825185 | 3.202738 | 2.560647 | 2.571281 | 13.985882 | 12.205000 | 14.285848 | 12.511125 | 3.034025e+05 | 5.515335e+05 | 3.446899e+05 | 7.222052e+04 |
| std | 8.083421e+04 | 0.552052 | 3.234734 | 209.088633 | 156.268221 | 170.001298 | 170.001298 | 0.409240 | 235.196984 | 0.455337 | 0.681273 | 0.611729 | 1.554286 | 2.986294 | 2.104387 | 3.470492 | 4.836255e+05 | 6.064403e+05 | 7.871878e+05 | 2.795679e+05 |
| min | 7.755837e+06 | 16.000000 | 8.000000 | 350.000000 | 530.000000 | 200508.000000 | 200602.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 9.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -1.212072e+06 |
| 25% | 7.827039e+06 | 18.000000 | 18.000000 | 900.000000 | 910.000000 | 200608.000000 | 200702.000000 | 1.000000 | 69.000000 | 2.870000 | 2.000000 | 2.000000 | 13.000000 | 11.000000 | 13.000000 | 11.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 7.899052e+06 | 18.000000 | 20.000000 | 990.000000 | 990.000000 | 200808.000000 | 200902.000000 | 1.000000 | 69.000000 | 3.200000 | 3.000000 | 3.000000 | 14.000000 | 13.000000 | 14.000000 | 13.000000 | 0.000000e+00 | 1.505400e+05 | 0.000000e+00 | 0.000000e+00 |
| 75% | 7.964008e+06 | 18.000000 | 22.000000 | 1100.000000 | 1110.000000 | 200908.000000 | 201002.000000 | 1.000000 | 138.000000 | 3.560000 | 3.000000 | 3.000000 | 15.000000 | 15.000000 | 16.000000 | 15.000000 | 5.995800e+05 | 1.192815e+06 | 3.241800e+05 | 1.876155e+05 |
| max | 8.037098e+06 | 26.000000 | 32.000000 | 1490.000000 | 1510.000000 | 201008.000000 | 201102.000000 | 1.000000 | 5932.000000 | 4.000000 | 4.000000 | 4.000000 | 21.000000 | 21.000000 | 23.000000 | 23.000000 | 2.124900e+06 | 2.124900e+06 | 5.999940e+06 | 1.632660e+06 |
# .value counts shows the counts of those variables like It shows number of Female and male students here
cf['STDNT_GENDER'].value_counts()
F 2140 M 1260 Name: STDNT_GENDER, dtype: int64
# It is used to group the data to see number of students left and returned
att=cf.groupby('RETURNED_2ND_YR')
att_count=cf.groupby('RETURNED_2ND_YR').count()
# it shows the number of left students under different categories
att_count
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_GENDER | STDNT_BACKGROUND | IN_STATE_FLAG | INTERNATIONAL_STS | STDNT_MAJOR | STDNT_MINOR | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | ... | DEGREE_GROUP_CD | DEGREE_GROUP_DESC | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RETURNED_2ND_YR | |||||||||||||||||||||
| 0 | 723 | 723 | 723 | 723 | 723 | 723 | 723 | 723 | 240 | 510 | ... | 723 | 723 | 723 | 723 | 544 | 544 | 723 | 723 | 723 | 723 |
| 1 | 2677 | 2677 | 2677 | 2677 | 2677 | 2677 | 2677 | 2677 | 866 | 1982 | ... | 2677 | 2677 | 2677 | 2677 | 2650 | 2647 | 2677 | 2677 | 2677 | 2677 |
2 rows × 55 columns
# It will plot the bar chart to see how many students left due to distane from home
import matplotlib.pyplot as plt
plt.bar(att_count.index.values,att_count['DISTANCE_FROM_HOME'])
plt.xlabel('Students left College')
plt.ylabel('Number of students')
plt.title ('DISTANCE_FROM_HOME')
plt.show()
# It gives count for number of students left and return
cf.RETURNED_2ND_YR.value_counts()
1 2677 0 723 Name: RETURNED_2ND_YR, dtype: int64
# Its new data frame to see all the students who left
cf_new=cf.query("RETURNED_2ND_YR=='0'")
# It shows first 10 left students data
cf_new.head(10)
| STUDENT IDENTIFIER | STDNT_AGE | STDNT_GENDER | STDNT_BACKGROUND | IN_STATE_FLAG | INTERNATIONAL_STS | STDNT_MAJOR | STDNT_MINOR | STDNT_TEST_ENTRANCE1 | STDNT_TEST_ENTRANCE2 | ... | DEGREE_GROUP_CD | DEGREE_GROUP_DESC | FIRST_TERM_ATTEMPT_HRS | FIRST_TERM_EARNED_HRS | SECOND_TERM_ATTEMPT_HRS | SECOND_TERM_EARNED_HRS | GROSS_FIN_NEED | COST_OF_ATTEND | EST_FAM_CONTRIBUTION | UNMET_NEED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7808615 | 18 | F | BGD 1 | Y | N | Undeclared | N | NaN | 1150.0 | ... | B | Bachelors | 16 | 16 | 14.0 | 14.0 | 0 | 0 | 0 | 0.0 |
| 11 | 7884573 | 18 | F | BGD 1 | Y | N | Undeclared | N | NaN | 1050.0 | ... | B | Bachelors | 16 | 16 | 16.0 | 16.0 | 0 | 1189920 | 5163240 | -326880.0 |
| 17 | 8004193 | 17 | M | BGD 1 | N | N | Biology | N | NaN | 1040.0 | ... | B | Bachelors | 14 | 14 | 14.0 | 14.0 | 1083000 | 1324500 | 241500 | 476580.0 |
| 23 | 8007842 | 18 | F | BGD 1 | Y | N | Pre-Nursing | N | NaN | 580.0 | ... | B | Bachelors | 15 | 15 | 15.0 | 15.0 | 0 | 0 | 0 | 0.0 |
| 36 | 7799972 | 18 | F | BGD 1 | Y | N | Undeclared | N | NaN | 620.0 | ... | B | Bachelors | 18 | 18 | 14.0 | 14.0 | 0 | 0 | 0 | 0.0 |
| 45 | 7911769 | 18 | F | BGD 1 | Y | N | Psychology | N | NaN | 1130.0 | ... | B | Bachelors | 15 | 15 | 15.0 | 15.0 | 355200 | 1176420 | 821220 | 56040.0 |
| 56 | 7889645 | 18 | F | BGD 1 | Y | N | Early Childhood Education | N | 26.0 | NaN | ... | B | Bachelors | 14 | 10 | 12.0 | 12.0 | 0 | 0 | 0 | 0.0 |
| 73 | 7807583 | 18 | F | BGD 6 | Y | N | Undeclared | N | NaN | 1360.0 | ... | B | Bachelors | 16 | 16 | 15.0 | 15.0 | 0 | 0 | 0 | 0.0 |
| 84 | 7935064 | 18 | F | BGD 1 | Y | N | Undeclared | N | 27.0 | NaN | ... | B | Bachelors | 15 | 15 | 16.0 | 16.0 | 0 | 0 | 0 | 0.0 |
| 85 | 7805566 | 17 | F | BGD 1 | Y | N | Undeclared | N | NaN | 1150.0 | ... | B | Bachelors | 13 | 13 | 12.0 | 12.0 | 0 | 1264860 | 1471380 | -347580.0 |
10 rows × 56 columns
# It shows missing values for new data frame for left students
cf_new.isnull().sum()
STUDENT IDENTIFIER 0 STDNT_AGE 0 STDNT_GENDER 0 STDNT_BACKGROUND 0 IN_STATE_FLAG 0 INTERNATIONAL_STS 0 STDNT_MAJOR 0 STDNT_MINOR 0 STDNT_TEST_ENTRANCE1 483 STDNT_TEST_ENTRANCE2 213 STDNT_TEST_ENTRANCE_COMB 119 FIRST_TERM 0 CORE_COURSE_NAME_1_F 0 CORE_COURSE_GRADE_1_F 0 CORE_COURSE_NAME_2_F 20 CORE_COURSE_GRADE_2_F 20 CORE_COURSE_NAME_3_F 116 CORE_COURSE_GRADE_3_F 116 CORE_COURSE_NAME_4_F 330 CORE_COURSE_GRADE_4_F 330 CORE_COURSE_NAME_5_F 594 CORE_COURSE_GRADE_5_F 594 CORE_COURSE_NAME_6_F 694 CORE_COURSE_GRADE_6_F 694 SECOND_TERM 0 CORE_COURSE_NAME_1_S 125 CORE_COURSE_GRADE_1_S 189 CORE_COURSE_NAME_2_S 221 CORE_COURSE_GRADE_2_S 221 CORE_COURSE_NAME_3_S 315 CORE_COURSE_GRADE_3_S 315 CORE_COURSE_NAME_4_S 489 CORE_COURSE_GRADE_4_S 489 CORE_COURSE_NAME_5_S 652 CORE_COURSE_GRADE_5_S 652 CORE_COURSE_NAME_6_S 710 CORE_COURSE_GRADE_6_S 710 HOUSING_STS 0 RETURNED_2ND_YR 0 DISTANCE_FROM_HOME 10 HIGH_SCHL_GPA 15 HIGH_SCHL_NAME 0 FATHER_HI_EDU_CD 103 FATHER_HI_EDU_DESC 0 MOTHER_HI_EDU_CD 123 MOTHER_HI_EDU_DESC 0 DEGREE_GROUP_CD 0 DEGREE_GROUP_DESC 0 FIRST_TERM_ATTEMPT_HRS 0 FIRST_TERM_EARNED_HRS 0 SECOND_TERM_ATTEMPT_HRS 179 SECOND_TERM_EARNED_HRS 179 GROSS_FIN_NEED 0 COST_OF_ATTEND 0 EST_FAM_CONTRIBUTION 0 UNMET_NEED 0 dtype: int64
# It gives information of what are data types..int , float object
cf_new.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 723 entries, 0 to 3399 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STUDENT IDENTIFIER 723 non-null int64 1 STDNT_AGE 723 non-null int64 2 STDNT_GENDER 723 non-null object 3 STDNT_BACKGROUND 723 non-null object 4 IN_STATE_FLAG 723 non-null object 5 INTERNATIONAL_STS 723 non-null object 6 STDNT_MAJOR 723 non-null object 7 STDNT_MINOR 723 non-null object 8 STDNT_TEST_ENTRANCE1 240 non-null float64 9 STDNT_TEST_ENTRANCE2 510 non-null float64 10 STDNT_TEST_ENTRANCE_COMB 604 non-null float64 11 FIRST_TERM 723 non-null int64 12 CORE_COURSE_NAME_1_F 723 non-null object 13 CORE_COURSE_GRADE_1_F 723 non-null object 14 CORE_COURSE_NAME_2_F 703 non-null object 15 CORE_COURSE_GRADE_2_F 703 non-null object 16 CORE_COURSE_NAME_3_F 607 non-null object 17 CORE_COURSE_GRADE_3_F 607 non-null object 18 CORE_COURSE_NAME_4_F 393 non-null object 19 CORE_COURSE_GRADE_4_F 393 non-null object 20 CORE_COURSE_NAME_5_F 129 non-null object 21 CORE_COURSE_GRADE_5_F 129 non-null object 22 CORE_COURSE_NAME_6_F 29 non-null object 23 CORE_COURSE_GRADE_6_F 29 non-null object 24 SECOND_TERM 723 non-null int64 25 CORE_COURSE_NAME_1_S 598 non-null object 26 CORE_COURSE_GRADE_1_S 534 non-null object 27 CORE_COURSE_NAME_2_S 502 non-null object 28 CORE_COURSE_GRADE_2_S 502 non-null object 29 CORE_COURSE_NAME_3_S 408 non-null object 30 CORE_COURSE_GRADE_3_S 408 non-null object 31 CORE_COURSE_NAME_4_S 234 non-null object 32 CORE_COURSE_GRADE_4_S 234 non-null object 33 CORE_COURSE_NAME_5_S 71 non-null object 34 CORE_COURSE_GRADE_5_S 71 non-null object 35 CORE_COURSE_NAME_6_S 13 non-null object 36 CORE_COURSE_GRADE_6_S 13 non-null object 37 HOUSING_STS 723 non-null object 38 RETURNED_2ND_YR 723 non-null int64 39 DISTANCE_FROM_HOME 713 non-null float64 40 HIGH_SCHL_GPA 708 non-null float64 41 HIGH_SCHL_NAME 723 non-null object 42 FATHER_HI_EDU_CD 620 non-null float64 43 FATHER_HI_EDU_DESC 723 non-null object 44 MOTHER_HI_EDU_CD 600 non-null float64 45 MOTHER_HI_EDU_DESC 723 non-null object 46 DEGREE_GROUP_CD 723 non-null object 47 DEGREE_GROUP_DESC 723 non-null object 48 FIRST_TERM_ATTEMPT_HRS 723 non-null int64 49 FIRST_TERM_EARNED_HRS 723 non-null int64 50 SECOND_TERM_ATTEMPT_HRS 544 non-null float64 51 SECOND_TERM_EARNED_HRS 544 non-null float64 52 GROSS_FIN_NEED 723 non-null int64 53 COST_OF_ATTEND 723 non-null int64 54 EST_FAM_CONTRIBUTION 723 non-null int64 55 UNMET_NEED 723 non-null float64 dtypes: float64(10), int64(10), object(36) memory usage: 322.0+ KB
# It will put the missing values in descending order with their percentage for each variable
drp_col=pd.DataFrame(round((cf_new.isnull().sum()/cf_new.shape[0])*100,2).sort_values(ascending=False)).reset_index()
# It will show number of rows and columns for left std data frame
cf_new.shape
(723, 56)
# It shows the percentage of missing values in descending orders
drp_col
| index | 0 | |
|---|---|---|
| 0 | CORE_COURSE_NAME_6_S | 98.20 |
| 1 | CORE_COURSE_GRADE_6_S | 98.20 |
| 2 | CORE_COURSE_NAME_6_F | 95.99 |
| 3 | CORE_COURSE_GRADE_6_F | 95.99 |
| 4 | CORE_COURSE_NAME_5_S | 90.18 |
| 5 | CORE_COURSE_GRADE_5_S | 90.18 |
| 6 | CORE_COURSE_NAME_5_F | 82.16 |
| 7 | CORE_COURSE_GRADE_5_F | 82.16 |
| 8 | CORE_COURSE_NAME_4_S | 67.63 |
| 9 | CORE_COURSE_GRADE_4_S | 67.63 |
| 10 | STDNT_TEST_ENTRANCE1 | 66.80 |
| 11 | CORE_COURSE_NAME_4_F | 45.64 |
| 12 | CORE_COURSE_GRADE_4_F | 45.64 |
| 13 | CORE_COURSE_NAME_3_S | 43.57 |
| 14 | CORE_COURSE_GRADE_3_S | 43.57 |
| 15 | CORE_COURSE_GRADE_2_S | 30.57 |
| 16 | CORE_COURSE_NAME_2_S | 30.57 |
| 17 | STDNT_TEST_ENTRANCE2 | 29.46 |
| 18 | CORE_COURSE_GRADE_1_S | 26.14 |
| 19 | SECOND_TERM_EARNED_HRS | 24.76 |
| 20 | SECOND_TERM_ATTEMPT_HRS | 24.76 |
| 21 | CORE_COURSE_NAME_1_S | 17.29 |
| 22 | MOTHER_HI_EDU_CD | 17.01 |
| 23 | STDNT_TEST_ENTRANCE_COMB | 16.46 |
| 24 | CORE_COURSE_NAME_3_F | 16.04 |
| 25 | CORE_COURSE_GRADE_3_F | 16.04 |
| 26 | FATHER_HI_EDU_CD | 14.25 |
| 27 | CORE_COURSE_NAME_2_F | 2.77 |
| 28 | CORE_COURSE_GRADE_2_F | 2.77 |
| 29 | HIGH_SCHL_GPA | 2.07 |
| 30 | DISTANCE_FROM_HOME | 1.38 |
| 31 | FIRST_TERM | 0.00 |
| 32 | CORE_COURSE_GRADE_1_F | 0.00 |
| 33 | STDNT_MINOR | 0.00 |
| 34 | STDNT_BACKGROUND | 0.00 |
| 35 | CORE_COURSE_NAME_1_F | 0.00 |
| 36 | STDNT_GENDER | 0.00 |
| 37 | STDNT_AGE | 0.00 |
| 38 | INTERNATIONAL_STS | 0.00 |
| 39 | STDNT_MAJOR | 0.00 |
| 40 | IN_STATE_FLAG | 0.00 |
| 41 | UNMET_NEED | 0.00 |
| 42 | SECOND_TERM | 0.00 |
| 43 | EST_FAM_CONTRIBUTION | 0.00 |
| 44 | HOUSING_STS | 0.00 |
| 45 | RETURNED_2ND_YR | 0.00 |
| 46 | HIGH_SCHL_NAME | 0.00 |
| 47 | FATHER_HI_EDU_DESC | 0.00 |
| 48 | MOTHER_HI_EDU_DESC | 0.00 |
| 49 | DEGREE_GROUP_CD | 0.00 |
| 50 | DEGREE_GROUP_DESC | 0.00 |
| 51 | FIRST_TERM_ATTEMPT_HRS | 0.00 |
| 52 | FIRST_TERM_EARNED_HRS | 0.00 |
| 53 | GROSS_FIN_NEED | 0.00 |
| 54 | COST_OF_ATTEND | 0.00 |
| 55 | STUDENT IDENTIFIER | 0.00 |
## it will drop the coloumns first 20 rows, as these are having high number of misssing values, which will bias our analysis
c=drp_col.nlargest(19,0)['index'].tolist()
# It will show number of rows and columns
drp_col.shape
(56, 2)
# It will create new data frame for selected data, here we have dropped columns which are above 25%
drop=drp_col[0:18]
It shows the data that we are not using for analysis We have dropped them as many are categorical & also not directly relevant for the analysis
drop
| index | 0 | |
|---|---|---|
| 0 | CORE_COURSE_NAME_6_S | 98.20 |
| 1 | CORE_COURSE_GRADE_6_S | 98.20 |
| 2 | CORE_COURSE_NAME_6_F | 95.99 |
| 3 | CORE_COURSE_GRADE_6_F | 95.99 |
| 4 | CORE_COURSE_NAME_5_S | 90.18 |
| 5 | CORE_COURSE_GRADE_5_S | 90.18 |
| 6 | CORE_COURSE_NAME_5_F | 82.16 |
| 7 | CORE_COURSE_GRADE_5_F | 82.16 |
| 8 | CORE_COURSE_NAME_4_S | 67.63 |
| 9 | CORE_COURSE_GRADE_4_S | 67.63 |
| 10 | STDNT_TEST_ENTRANCE1 | 66.80 |
| 11 | CORE_COURSE_NAME_4_F | 45.64 |
| 12 | CORE_COURSE_GRADE_4_F | 45.64 |
| 13 | CORE_COURSE_NAME_3_S | 43.57 |
| 14 | CORE_COURSE_GRADE_3_S | 43.57 |
| 15 | CORE_COURSE_GRADE_2_S | 30.57 |
| 16 | CORE_COURSE_NAME_2_S | 30.57 |
| 17 | STDNT_TEST_ENTRANCE2 | 29.46 |
# new data frame created by name vp for drop by Name
vp=cf_new.drop(drop['index'], axis=1)
This data frame is only for the student who left college
# It shows the number of columns under data frame vp
vp.columns
Index(['STUDENT IDENTIFIER', 'STDNT_AGE', 'STDNT_GENDER', 'STDNT_BACKGROUND',
'IN_STATE_FLAG', 'INTERNATIONAL_STS', 'STDNT_MAJOR', 'STDNT_MINOR',
'STDNT_TEST_ENTRANCE_COMB', 'FIRST_TERM', 'CORE_COURSE_NAME_1_F',
'CORE_COURSE_GRADE_1_F', 'CORE_COURSE_NAME_2_F',
'CORE_COURSE_GRADE_2_F', 'CORE_COURSE_NAME_3_F',
'CORE_COURSE_GRADE_3_F', 'SECOND_TERM', 'CORE_COURSE_NAME_1_S',
'CORE_COURSE_GRADE_1_S', 'HOUSING_STS', 'RETURNED_2ND_YR',
'DISTANCE_FROM_HOME', 'HIGH_SCHL_GPA', 'HIGH_SCHL_NAME',
'FATHER_HI_EDU_CD', 'FATHER_HI_EDU_DESC', 'MOTHER_HI_EDU_CD',
'MOTHER_HI_EDU_DESC', 'DEGREE_GROUP_CD', 'DEGREE_GROUP_DESC',
'FIRST_TERM_ATTEMPT_HRS', 'FIRST_TERM_EARNED_HRS',
'SECOND_TERM_ATTEMPT_HRS', 'SECOND_TERM_EARNED_HRS', 'GROSS_FIN_NEED',
'COST_OF_ATTEND', 'EST_FAM_CONTRIBUTION', 'UNMET_NEED'],
dtype='object')
vp.shape
(723, 38)
# It shows Gender of left students
vp['STDNT_GENDER'].value_counts()
F 475 M 248 Name: STDNT_GENDER, dtype: int64
#profile report for the Entire Data
import pandas as pd,os,numpy as np,pandas_profiling as pf
pf.ProfileReport(cf)